WEBVTT

00:00:14.752 --> 00:00:21.696
- All right welcome to lecture nine. So today
we will be talking about CNN Architectures.

00:00:21.696 --> 00:00:27.706
And just a few administrative points before we
get started, assignment two is due Thursday.

00:00:27.706 --> 00:00:36.855
The mid term will be in class on Tuesday May ninth, so next week and it will
cover material through Tuesday through this coming Thursday May fourth.

00:00:36.855 --> 00:00:41.350
So everything up to recurrent neural
networks are going to be fair game.

00:00:41.350 --> 00:00:49.121
The poster session we've decided on a time, it's going to be Tuesday June
sixth from twelve to three p.m. So this is the last week of classes.

00:00:49.121 --> 00:00:53.828
So we have our our poster session a little bit
early during the last week so that after that,

00:00:53.828 --> 00:01:00.132
once you guys get feedback you still have some time to
work for your final report which will be due finals week.

00:01:03.325 --> 00:01:05.812
Okay, so just a quick review of last time.

00:01:05.812 --> 00:01:09.324
Last time we talked about different
kinds of deep learning frameworks.

00:01:09.324 --> 00:01:12.690
We talked about you know
PyTorch, TensorFlow, Caffe2

00:01:14.514 --> 00:01:18.762
and we saw that using these kinds of frameworks we
were able to easily build big computational graphs,

00:01:18.762 --> 00:01:25.784
for example very large neural networks and comm nets, and
be able to really easily compute gradients in these graphs.

00:01:25.784 --> 00:01:32.415
So to compute all of the gradients for all the intermediate
variables weights inputs and use that to train our models

00:01:32.415 --> 00:01:35.665
and to run all this efficiently on GPUs

00:01:37.658 --> 00:01:44.978
And we saw that for a lot of these frameworks the way this works is by working
with these modularized layers that you guys have been working writing with,

00:01:44.978 --> 00:01:49.928
in your home works as well where we have
a forward pass, we have a backward pass,

00:01:49.928 --> 00:01:58.404
and then in our final model architecture, all we need to do then
is to just define all of these sequence of layers together.

00:01:58.404 --> 00:02:04.937
So using that we're able to very easily be able
to build up very complex network architectures.

00:02:06.626 --> 00:02:14.520
So today we're going to talk about some specific kinds of CNN Architectures
that are used today in cutting edge applications and research.

00:02:14.520 --> 00:02:19.631
And so we'll go into depth in some of the most
commonly used architectures for these that are winners

00:02:19.631 --> 00:02:22.125
of ImageNet classification benchmarks.

00:02:22.125 --> 00:02:28.085
So in chronological order AlexNet,
VGG net, GoogLeNet, and ResNet.

00:02:28.085 --> 00:02:43.771
And so these will go into a lot of depth. And then I'll also after that, briefly go through some other architectures that are
not as prominently used these days, but are interesting either from a historical perspective, or as recent areas of research.

00:02:46.822 --> 00:02:50.839
Okay, so just a quick review.
We talked a long time ago about LeNet,

00:02:50.839 --> 00:02:55.603
which was one of the first instantiations of a
comNet that was successfully used in practice.

00:02:55.603 --> 00:03:05.778
And so this was the comNet that took an input image, used com filters five
by five filters applied at stride one and had a couple of conv layers,

00:03:05.778 --> 00:03:09.335
a few pooling layers and then some
fully connected layers at the end.

00:03:09.335 --> 00:03:14.320
And this fairly simple comNet was very
successfully applied to digit recognition.

00:03:17.030 --> 00:03:22.875
So AlexNet from 2012 which you guys have also
heard already before in previous classes,

00:03:22.875 --> 00:03:31.179
was the first large scale convolutional neural network
that was able to do well on the ImageNet classification

00:03:31.179 --> 00:03:40.611
task so in 2012 AlexNet was entered in the competition, and was able to
outperform all previous non deep learning based models by a significant margin,

00:03:40.611 --> 00:03:48.012
and so this was the comNet that started the
spree of comNet research and usage afterwards.

00:03:48.012 --> 00:03:56.427
And so the basic comNet AlexNet architecture is a conv layer
followed by pooling layer, normalization, com pool norm,

00:03:58.421 --> 00:04:01.006
and then a few more conv
layers, a pooling layer,

00:04:01.006 --> 00:04:03.422
and then several fully
connected layers afterwards.

00:04:03.422 --> 00:04:09.766
So this actually looks very similar to the LeNet network
that we just saw. There's just more layers in total.

00:04:09.766 --> 00:04:18.387
There is five of these conv layers, and two fully connected layers
before the final fully connected layer going to the output classes.

00:04:21.889 --> 00:04:25.930
So let's first get a sense of the
sizes involved in the AlexNet.

00:04:25.930 --> 00:04:33.128
So if we look at the input to the AlexNet this was trained
on ImageNet, with inputs at a size 227 by 227 by 3 images.

00:04:33.128 --> 00:04:43.193
And if we look at this first layer which is a conv layer for the
AlexNet, it's 11 by 11 filters, 96 of these applied at stride 4.

00:04:43.193 --> 00:04:49.323
So let's just think about this for a moment.
What's the output volume size of this first layer?

00:04:51.788 --> 00:04:53.371
And there's a hint.

00:04:57.769 --> 00:05:11.441
So remember we have our input size, we have our convolutional filters, ray. And we have this formula,
which is the hint over here that gives you the size of the output dimensions after applying com right?

00:05:11.441 --> 00:05:17.632
So remember it was the full image, minus the
filter size, divided by the stride, plus one.

00:05:17.632 --> 00:05:26.919
So given that that's written up here for you 55, does anyone have
a guess at what's the final output size after this conv layer?

00:05:26.919 --> 00:05:29.823
[student speaks off mic]

00:05:29.823 --> 00:05:32.966
- So I had 55 by 55 by 96, yep.
That's correct.

00:05:32.966 --> 00:05:38.113
Right so our spatial dimensions at the output are
going to be 55 in each dimension and then we have

00:05:38.113 --> 00:05:45.391
96 total filters so the depth after our conv layer
is going to be 96. So that's the output volume.

00:05:45.391 --> 00:05:49.486
And what's the total number
of parameters in this layer?

00:05:49.486 --> 00:05:52.819
So remember we have 96 11 by 11 filters.

00:05:54.851 --> 00:05:57.753
[student speaks off mic]

00:05:57.753 --> 00:06:00.753
- [Lecturer] 96 by 11 by 11, almost.

00:06:01.945 --> 00:06:05.297
So yes, so I had another by three,
yes that's correct.

00:06:05.297 --> 00:06:13.632
So each of the filters is going to see through a local region
of 11 by 11 by three, right because the input depth was three.

00:06:13.632 --> 00:06:18.983
And so, that's each filter size,
times we have 96 of these total.

00:06:18.983 --> 00:06:23.150
And so there's 35K parameters
in this first layer.

00:06:26.018 --> 00:06:30.233
Okay, so now if we look at the second layer
this is a pooling layer right and in this case

00:06:30.233 --> 00:06:34.004
we have three three by three
filters applied at stride two.

00:06:34.004 --> 00:06:38.171
So what's the output volume
of this layer after pooling?

00:06:40.701 --> 00:06:44.868
And again we have a hint, very
similar to the last question.

00:06:51.251 --> 00:06:56.267
Okay, 27 by 27 by 96.
Yes that's correct.

00:06:57.716 --> 00:07:01.528
Right so the pooling layer is basically
going to use this formula that we had here.

00:07:01.528 --> 00:07:16.655
Again because these are pooling applied at a stride of two so we're going to use the same formula to determine
the spatial dimensions and so the spatial dimensions are going to be 27 by 27, and pooling preserves the depth.

00:07:16.655 --> 00:07:21.527
So we had 96 as depth as input, and it's
still going to be 96 depth at output.

00:07:22.825 --> 00:07:28.127
And next question. What's the
number of parameters in this layer?

00:07:31.446 --> 00:07:34.354
I hear some muttering.
[student answers off mic]

00:07:34.354 --> 00:07:36.905
- Nothing.
Okay.

00:07:36.905 --> 00:07:40.801
Yes, so pooling layer has no parameters,
so, kind of a trick question.

00:07:42.739 --> 00:07:45.272
Okay, so we can basically, yes, question?

00:07:45.272 --> 00:07:47.192
[student speaks off mic]

00:07:47.192 --> 00:07:52.180
- The question is, why are there no
parameters in the pooling layer?

00:07:52.180 --> 00:07:54.551
The parameters are the weights right,
that we're trying to learn.

00:07:54.551 --> 00:07:56.511
And so convolutional layers
have weights that we learn

00:07:56.511 --> 00:08:02.236
but pooling all we do is have a rule, we look
at the pooling region, and we take the max.

00:08:02.236 --> 00:08:05.710
So there's no parameters that are learned.

00:08:05.710 --> 00:08:14.250
So we can keep on doing this and you can just repeat the process and it's kind of a good
exercise to go through this and figure out the sizes, the parameters, at every layer.

00:08:16.473 --> 00:08:22.688
And so if you do this all the way, you can look at
this is the final architecture that you can work with.

00:08:22.688 --> 00:08:31.920
There's 11 by 11 filters at the beginning, then five by five and some three
by three filters. And so these are generally pretty familiar looking sizes

00:08:31.920 --> 00:08:39.122
that you've seen before and then at the end we have a couple of
fully connected layers of size 4096 and finally the last layer,

00:08:39.123 --> 00:08:41.540
is FC8 going to the soft max,

00:08:42.689 --> 00:08:46.356
which is going to the
1000 ImageNet classes.

00:08:48.039 --> 00:08:56.352
And just a couple of details about this, it was the first use of the ReLu
non-linearity that we've talked about that's the most commonly used non-linearity.

00:08:56.352 --> 00:09:07.391
They used local response normalization layers basically trying to normalize the response
across neighboring channels but this is something that's not really used anymore.

00:09:07.391 --> 00:09:11.937
It turned out not to, other people showed
that it didn't have so much of an effect.

00:09:11.937 --> 00:09:21.769
There's a lot of heavy data augmentation, and so you can look in the paper for more details,
but things like flipping, jittering, jittering, color normalization all of these things

00:09:21.769 --> 00:09:28.727
which you'll probably find useful for you when you're working on
your projects for example, so a lot of data augmentation here.

00:09:28.727 --> 00:09:32.419
They also use dropout batch size of 128,

00:09:32.419 --> 00:09:37.183
and learned with SGD with
momentum which we talked about

00:09:37.183 --> 00:09:42.295
in an earlier lecture, and basically just started
with a base learning rate of 1e negative 2.

00:09:42.295 --> 00:09:50.145
Every time it plateaus, reduce by a factor of 10 and
then just keep going. Until they finish training

00:09:50.145 --> 00:09:59.012
and a little bit of weight decay and in the end, in order to get the best numbers
they also did an ensembling of models and so training multiple of these,

00:09:59.012 --> 00:10:03.162
averaging them together and this also
gives an improvement in performance.

00:10:04.405 --> 00:10:08.781
And so one other thing I want to point out is
that if you look at this AlexNet diagram up here,

00:10:08.781 --> 00:10:15.235
it looks kind of like the normal comNet diagrams
that we've been seeing, except for one difference,

00:10:15.235 --> 00:10:21.937
which is that it's, you can see it's kind of split
in these two different rows or columns going across.

00:10:23.177 --> 00:10:32.905
And so the reason for this is mostly historical note, so AlexNet was
trained on GTX580 GPUs older GPUs that only had three gigs of memory.

00:10:34.106 --> 00:10:37.255
So it couldn't actually fit
this entire network on here,

00:10:37.255 --> 00:10:41.773
and so what they ended up doing, was
they spread the network across two GPUs.

00:10:41.773 --> 00:10:46.455
So on each GPU you would have half of the
neurons, or half of the feature maps.

00:10:46.455 --> 00:10:51.730
And so for example if you look at this first
conv layer, we have 55 by 55 by 96 output,

00:10:54.389 --> 00:11:04.155
but if you look at this diagram carefully, you can zoom in later in the actual
paper, you can see that, it's actually only 48 depth-wise, on each GPU,

00:11:05.049 --> 00:11:08.593
and so they just spread it, the
feature maps, directly in half.

00:11:10.288 --> 00:11:17.367
And so what happens is that for most of these layers, for example com
one, two, four and five, the connections are only with feature maps

00:11:17.367 --> 00:11:29.683
on the same GPU, so you would take as input, half of the feature maps that were on the
the same GPU as before and you don't look at the full 96 feature maps for example.

00:11:29.683 --> 00:11:33.850
You just take as input the
48 in that first layer.

00:11:34.767 --> 00:11:47.696
And then there's a few layers so com three, as well as FC six, seven and eight, where here are the
GPUs do talk to each other and so there's connections with all feature maps in the preceding layer.

00:11:47.696 --> 00:11:54.191
so there's communication across the GPUs, and each of these neurons
are then connected to the full depth of the previous input layer.

00:11:54.191 --> 00:11:55.627
Question.

00:11:55.627 --> 00:12:01.442
- [Student] It says the full simplified
AlexNetwork architecture. [mumbles]

00:12:05.583 --> 00:12:10.033
- Oh okay, so the question is why does it say
full simplified AlexNet architecture here?

00:12:10.033 --> 00:12:19.036
It just says that because I didn't put all the details on here, so for example
this is the full set of layers in the architecture, and the strides and so on,

00:12:19.036 --> 00:12:25.268
but for example the normalization layer, there's
other, these details are not written on here.

00:12:30.637 --> 00:12:37.849
And then just one little note, if you look at the paper and
try and write out the math and architectures and so on,

00:12:38.858 --> 00:12:52.721
there's a little bit of an issue on the very first layer they'll say if you'll look in the figure they'll say 224 by 224 ,
but there's actually some kind of funny pattern going on and so the numbers actually work out if you look at it as 227.

00:12:54.982 --> 00:13:04.261
AlexNet was the winner of the ImageNet classification benchmark in
2012, you can see that it cut the error rate by quite a large margin.

00:13:05.246 --> 00:13:14.193
It was the first CNN base winner, and it was widely used as a base to our
architecture almost ubiquitously from then until a couple years ago.

00:13:15.720 --> 00:13:17.980
It's still used quite a bit.

00:13:17.980 --> 00:13:24.071
It's used in transfer learning for lots of different
tasks and so it was used for basically a long time,

00:13:24.071 --> 00:13:33.202
and it was very famous and now though there's been some more recent architectures
that have generally just had better performance and so we'll talk about these

00:13:33.202 --> 00:13:39.282
next and these are going to be the more common
architectures that you'll be wanting to use in practice.

00:13:40.853 --> 00:13:47.813
So just quickly first in 2013 the ImageNet
challenge was won by something called a ZFNet.

00:13:47.813 --> 00:13:48.718
Yes, question.

00:13:48.718 --> 00:13:52.729
[student speaks off mic]

00:13:52.729 --> 00:13:56.612
- So the question is intuition why AlexNet was
so much better than the ones that came before,

00:13:56.612 --> 00:14:04.786
DefLearning comNets [mumbles] this is just a
very different kind of approach in architecture.

00:14:04.786 --> 00:14:09.004
So this was the first deep learning based
approach first comNet that was used.

00:14:12.445 --> 00:14:18.298
So in 2013 the challenge was won by something called a
ZFNet [Zeller Fergus Net] named after the creators.

00:14:18.298 --> 00:14:23.749
And so this mostly was improving
hyper parameters over the AlexNet.

00:14:23.749 --> 00:14:35.735
It had the same number of layers, the same general structure and they made a few changes things like changing
the stride size, different numbers of filters and after playing around with these hyper parameters more,

00:14:35.735 --> 00:14:41.369
they were able to improve the error rate.
But it's still basically the same idea.

00:14:41.369 --> 00:14:49.843
So in 2014 there are a couple of architectures that were now more
significantly different and made another jump in performance,

00:14:49.843 --> 00:14:58.178
and the main difference with these networks
first of all was much deeper networks.

00:14:58.178 --> 00:15:12.321
So from the eight layer network that was in 2012 and 2013, now in 2014 we had two
very close winners that were around 19 layers and 22 layers. So significantly deeper.

00:15:12.321 --> 00:15:16.502
And the winner of this
was GoogleNet, from Google

00:15:16.502 --> 00:15:20.176
but very close behind was
something called VGGNet

00:15:20.176 --> 00:15:27.421
from Oxford, and on actually the localization challenge
VGG got first place in some of the other tracks.

00:15:27.421 --> 00:15:31.958
So these were both very,
very strong networks.

00:15:31.958 --> 00:15:34.663
So let's first look at VGG
in a little bit more detail.

00:15:34.663 --> 00:15:40.818
And so the VGG network is the idea of much
deeper networks and with much smaller filters.

00:15:40.818 --> 00:15:50.374
So they increased the number of layers from eight layers in AlexNet
right to now they had models with 16 to 19 layers in VGGNet.

00:15:52.290 --> 00:16:03.916
And one key thing that they did was they kept very small filter so only three by three conv all the way,
which is basically the smallest com filter size that is looking at a little bit of the neighboring pixels.

00:16:03.916 --> 00:16:11.485
And they just kept this very simple structure of three by three
convs with the periodic pooling all the way through the network.

00:16:11.485 --> 00:16:19.948
And it's very simple elegant network architecture, was
able to get 7.3% top five error on the ImageNet challenge.

00:16:22.651 --> 00:16:27.442
So first the question of
why use smaller filters.

00:16:27.442 --> 00:16:33.371
So when we take these small filters now we have
fewer parameters and we try and stack more of them

00:16:33.371 --> 00:16:39.344
instead of having larger filters, have smaller filters with
more depth instead, have more of these filters instead,

00:16:39.344 --> 00:16:47.202
what happens is that you end up having the same effective receptive
field as if you only have one seven by seven convolutional layer.

00:16:47.202 --> 00:16:55.466
So here's a question, what is the effective receptive field
of three of these three by three conv layers with stride one?

00:16:55.466 --> 00:17:01.189
So if you were to stack three three by three conv layers
with Stride one what's the effective receptive field,

00:17:01.189 --> 00:17:09.754
the total area of the input, spatial area of the input that
enure at the top layer of the three layers is looking at.

00:17:12.313 --> 00:17:15.987
So I heard fifteen pixels,
why fifteen pixels?

00:17:15.987 --> 00:17:20.609
- [Student] Okay, so the
reason given was because

00:17:20.609 --> 00:17:27.369
they overlap-- - Okay, so the reason given was
because they overlap. So it's on the right track.

00:17:27.369 --> 00:17:35.668
What actually is happening though is you have to see, at the first
layer, the receptive field is going to be three by three right?

00:17:35.668 --> 00:17:43.193
And then at the second layer, each of these neurons in the second
layer is going to look at three by three other first layer

00:17:43.193 --> 00:17:51.676
filters, but the corners of these three by three have an additional
pixel on each side, that is looking at in the original input layer.

00:17:51.676 --> 00:17:56.423
So the second layer is actually looking at five by
five receptive field and then if you do this again,

00:17:56.423 --> 00:18:04.040
the third layer is looking at three by three
in the second layer but this is going to,

00:18:04.040 --> 00:18:06.907
if you just draw out this pyramid is looking
at seven by seven in the input layer.

00:18:06.907 --> 00:18:16.026
So the effective receptive field here is going to be seven by
seven. Which is the same as one seven by seven conv layer.

00:18:16.026 --> 00:18:21.546
So what happens is that this has the same effective receptive
field as a seven by seven conv layer but it's deeper.

00:18:21.546 --> 00:18:26.201
It's able to have more non-linearities in
there, and it's also fewer parameters.

00:18:26.201 --> 00:18:36.536
So if you look at the total number of parameters, each of these conv filters for
the three by threes is going to have nine parameters in each conv [mumbles]

00:18:38.165 --> 00:18:44.648
three times three, and then times the input depth, so
three times three times C, times this total number

00:18:44.648 --> 00:18:51.034
of output feature maps, which is again C is we're
going to preserve the total number of channels.

00:18:51.034 --> 00:19:00.165
So you get three times three, times C times C for each of these layers,
and we have three layers so it's going to be three times this number,

00:19:00.165 --> 00:19:07.409
compared to if you had a single seven by seven layer then you
get, by the same reasoning, seven squared times C squared.

00:19:07.409 --> 00:19:11.032
So you're going to have fewer
parameters total, which is nice.

00:19:15.570 --> 00:19:24.161
So now if we look at this full network here there's a lot of numbers up
here that you can go back and look at more carefully but if we look at all

00:19:24.161 --> 00:19:30.716
of the sizes and number of parameters the same
way that we calculated the example for AlexNet,

00:19:30.716 --> 00:19:32.517
this is a good exercise to go through,

00:19:32.517 --> 00:19:45.834
we can see that you know going the same way we have a couple of these conv layers and a pooling layer a
couple more conv layers, pooling layer, several more conv layers and so on. And so this just keeps going up.

00:19:45.834 --> 00:19:52.431
And if you counted the total number of convolutional and fully
connected layers, we're going to have 16 in this case for VGG 16,

00:19:52.431 --> 00:20:00.478
and then VGG 19, it's just a very similar architecture,
but with a few more conv layers in there.

00:20:03.021 --> 00:20:05.605
And so the total memory
usage of this network,

00:20:05.605 --> 00:20:17.196
so just making a forward pass through counting up all of these numbers so in the
memory numbers here written in terms of the total numbers, like we calculated earlier,

00:20:17.196 --> 00:20:23.125
and if you look at four bytes per number,
this is going to be about 100 megs per image,

00:20:23.125 --> 00:20:28.727
and so this is the scale of the memory usage that's
happening and this is only for a forward pass right,

00:20:28.727 --> 00:20:35.470
when you do a backward pass you're going to have to
store more and so this is pretty heavy memory wise.

00:20:35.470 --> 00:20:44.410
100 megs per image, if you have on five gigs of total memory,
then you're only going to be able to store about 50 of these.

00:20:47.300 --> 00:20:56.131
And so also the total number of parameters here we have is 138 million
parameters in this network, and this compares with 60 million for AlexNet.

00:20:56.131 --> 00:20:57.481
Question?

00:20:57.481 --> 00:21:00.898
[student speaks off mic]

00:21:06.204 --> 00:21:09.920
- So the question is what do we mean by deeper,
is it the number of filters, number of layers?

00:21:09.920 --> 00:21:14.087
So deeper in this case is
always referring to layers.

00:21:15.605 --> 00:21:25.216
So there are two usages of the word depth which is confusing one is the depth
rate per channel, width by height by depth, you can use the word depth here,

00:21:26.942 --> 00:21:34.298
but in general we talk about the depth of a network, this is going to
be the total number of layers in the network, and usually in particular

00:21:34.298 --> 00:21:43.368
we're counting the total number of weight layers. So the total number of layers
with trainable weight, so convolutional layers and fully connected layers.

00:21:43.368 --> 00:21:46.868
[student mumbles off mic]

00:22:00.810 --> 00:22:06.174
- Okay, so the question is, within each
layer what do different filters need?

00:22:06.174 --> 00:22:13.043
And so we talked about this back in the comNet
lecture, so you can also go back and refer to that,

00:22:13.043 --> 00:22:27.616
but each filter is a set of let's say three by three convs, so each filter is looking at a, is a set
of weight looking at a three by three value input input depth, and this produces one feature map,

00:22:27.616 --> 00:22:31.954
one activation map of all the responses
of the different spatial locations.

00:22:31.954 --> 00:22:39.646
And then we have we can have as many filters as we want right so for
example 96 and each of these is going to produce a feature map.

00:22:39.646 --> 00:22:48.368
And so it's just like each filter corresponds to a different pattern that we're looking for
in the input that we convolve around and we see the responses everywhere in the input,

00:22:48.368 --> 00:22:56.181
we create a map of these and then another filter will
we convolve over the image and create another map.

00:22:58.761 --> 00:23:00.226
Question.

00:23:00.226 --> 00:23:03.643
[student speaks off mic]

00:23:07.465 --> 00:23:16.733
- So question is, is there intuition behind, as you go deeper into the network
we have more channel depth so more number of filters right and so you can have

00:23:17.676 --> 00:23:21.766
any design that you want so
you don't have to do this.

00:23:21.766 --> 00:23:24.341
In practice you will see this
happen a lot of the times

00:23:24.341 --> 00:23:30.598
and one of the reasons is people try and maintain
kind of a relatively constant level of compute,

00:23:30.598 --> 00:23:37.991
so as you go higher up or deeper into your network,
you're usually also using basically down sampling

00:23:39.606 --> 00:23:45.759
and having smaller total spatial area and then so then
they also increase now you increase by depth a little bit,

00:23:45.759 --> 00:23:53.367
it's not as expensive now to increase by depth because
it's spatially smaller and so, yeah that's just a reason.

00:23:53.367 --> 00:23:54.716
Question.

00:23:54.716 --> 00:23:58.133
[student speaks off mic]

00:23:59.872 --> 00:24:04.653
- So performance-wise is there any reason to use
SBN [mumbles] instead of SouthMax [mumbles],

00:24:04.653 --> 00:24:09.761
so no, for a classifier you can use either one,
and you did that earlier in the class as well,

00:24:09.761 --> 00:24:17.242
but in general SouthMax losses, have generally worked
well and been standard use for classification here.

00:24:18.509 --> 00:24:20.023
Okay yeah one more question.

00:24:20.023 --> 00:24:23.523
[student mumbles off mic]

00:24:37.902 --> 00:24:45.398
- Yes, so the question is, we don't have to store all of the memory
like we can throw away the parts that we don't need and so on?

00:24:45.398 --> 00:24:49.221
And yes this is true.
Some of this you don't need to keep,

00:24:49.221 --> 00:25:02.571
but you're also going to be doing a backwards pass through ware for the most part, when you were doing the chain rule
and so on you needed a lot of these activations as part of it and so in large part a lot of this does need to be kept.

00:25:04.006 --> 00:25:14.440
So if we look at the distribution of where memory is used and where parameters are,
you can see that a lot of memories in these early layers right where you still have

00:25:14.440 --> 00:25:24.054
spatial dimensions you're going to have more memory usage and then a lot of
the parameters are actually in the last layers, the fully connected layers

00:25:24.054 --> 00:25:28.837
have a huge number of parameters right, because
we have all of these dense connections.

00:25:28.837 --> 00:25:36.999
And so that's something just to know and then keep
in mind so later on we'll see some networks actually

00:25:36.999 --> 00:25:42.345
get rid of these fully connected layers and be
able to save a lot on the number of parameters.

00:25:42.345 --> 00:25:48.059
And then just one last thing to point out, you'll also
see different ways of calling all of these layers right.

00:25:48.059 --> 00:25:56.190
So here I've written out exactly what the layers are.
conv3-64 means three by three convs with 64 total filters.

00:25:56.190 --> 00:26:05.190
But for VGGNet on this diagram on the right here there's also
common ways that people will look at each group of filters,

00:26:05.190 --> 00:26:11.822
so each orange block here, as in conv1
part one, so conv1-1, conv1-2, and so on.

00:26:11.822 --> 00:26:14.655
So just something to keep in mind.

00:26:16.594 --> 00:26:22.120
So VGGNet ended up getting second place in
the ImageNet 2014 classification challenge,

00:26:22.120 --> 00:26:24.783
first in localization.

00:26:24.783 --> 00:26:29.037
They followed a very similar training
procedure as Alex Krizhevsky for the AlexNet.

00:26:29.037 --> 00:26:38.764
They didn't use local response normalization, so as I mentioned earlier,
they found out this didn't really help them, and so they took it out.

00:26:38.764 --> 00:26:49.615
You'll see VGG 16 and VGG 19 are common variants of the cycle here,
and this is just the number of layers, 19 is slightly deeper than 16.

00:26:49.615 --> 00:27:00.366
In practice VGG 19 works very little bit better, and there's a little bit
more memory usage, so you can use either but 16 is very commonly used.

00:27:01.470 --> 00:27:10.110
For best results, like AlexNet, they did ensembling in order
to average several models, and you get better results.

00:27:10.110 --> 00:27:20.158
And they also showed in their work that the FC7 features of the last
fully connected layer before going to the 1000 ImageNet classes.

00:27:20.158 --> 00:27:26.463
The 4096 size layer just before that,
is a good feature representation,

00:27:26.463 --> 00:27:35.055
that can even just be used as is, to extract these features
from other data, and generalized these other tasks as well.

00:27:35.055 --> 00:27:37.792
And so FC7 is a good
feature representation.

00:27:37.792 --> 00:27:39.142
Yeah question.

00:27:39.142 --> 00:27:44.432
[student speaks off mic]
- Sorry what was the question?

00:27:45.939 --> 00:27:50.036
Okay, so the question is
what is localization here?

00:27:50.036 --> 00:27:57.163
And so this is a task, and we'll talk about it a little bit more in
a later lecture on detection and localization so I don't want to

00:27:57.163 --> 00:28:03.205
go into detail here but it's basically an image, not
just classifying What's the class of the image,

00:28:03.205 --> 00:28:09.433
but also drawing a bounding box around
where that object is in the image.

00:28:09.433 --> 00:28:16.153
And the difference with detection, which is a very related task is that
detection there can be multiple instances of this object in the image

00:28:16.153 --> 00:28:22.671
localization we're assuming there's just one, this
classification but we just how this additional bounding box.

00:28:25.343 --> 00:28:32.382
So we looked at VGG which was one of the deep networks
from 2014 and then now we'll talk about GoogleNet

00:28:32.382 --> 00:28:36.603
which was the other one that won
the classification challenge.

00:28:37.612 --> 00:28:47.776
So GoogleNet again was a much deeper network with 22 layers but one of
the main insights and special things about GoogleNet is that it really

00:28:47.776 --> 00:28:57.866
looked at this problem of computational efficiency and it tried to design
a network architecture that was very efficient in the amount of compute.

00:28:57.866 --> 00:29:05.023
And so they did this using this inception module which
we'll go into more detail and basically stacking

00:29:05.023 --> 00:29:08.336
a lot of these inception
modules on top of each other.

00:29:08.336 --> 00:29:19.841
There's also no fully connected layers in this network, so they got rid of that were able to save a lot of
parameters and so in total there's only five million parameters which is twelve times less than AlexNet,

00:29:19.841 --> 00:29:24.308
which had 60 million even
though it's much deeper now.

00:29:24.308 --> 00:29:26.975
It got 6.7% top five error.

00:29:31.392 --> 00:29:35.363
So what's the inception module?
So the idea behind the inception module

00:29:35.363 --> 00:29:40.023
is that they wanted to design
a good local network typology

00:29:40.023 --> 00:29:52.341
and it has this idea of this local topology that's you know you can think of it as a network
within a network and then stack a lot of these local typologies one on top of each other.

00:29:52.341 --> 00:29:58.387
And so in this local network that they're calling an
inception module what they're doing is they're basically

00:29:58.387 --> 00:30:07.138
applying several different kinds of filter operations in
parallel on top of the same input coming into this same layer.

00:30:07.138 --> 00:30:11.896
So we have our input coming in from the previous layer and
then we're going to do different kinds of convolutions.

00:30:11.896 --> 00:30:25.647
So a one by one conv, right a three by three conv, five by five conv, and then they also have a pooling operation
in this case three by three pooling, and so you get all of these different outputs from these different layers,

00:30:25.647 --> 00:30:31.499
and then what they do is they concatenate all
these filter outputs together depth wise, and so

00:30:31.499 --> 00:30:38.893
then this creates one tenser output at the end
that is going tom pass on to the next layer.

00:30:41.020 --> 00:30:50.015
So if we look at just a naive way of doing this we just do exactly that we have all
of these different operations we get the outputs we concatenate them together.

00:30:50.015 --> 00:30:52.386
So what's the problem with this?

00:30:52.386 --> 00:30:57.717
And it turns out that computational
complexity is going to be a problem here.

00:30:58.982 --> 00:31:11.156
So if we look more carefully at an example, so here just for as an example I've put one by
one conv, 128 filter so three by three conv 192 filters, five by five convs and 96 filters.

00:31:11.156 --> 00:31:19.398
Assume everything has basically the stride that's going to maintain
the spatial dimensions, and that we have this input coming in.

00:31:21.341 --> 00:31:29.231
So what is the output size of the one by one filter with
128 , one by one conv with 128 filters? Who has a guess?

00:31:35.910 --> 00:31:39.910
OK so I heard 28 by 28,
by 128 which is correct.

00:31:40.988 --> 00:31:53.159
So right by one by one conv we're going to maintain spatial dimensions and then on top
of that, each conv filter is going to look through the entire 256 depth of the input,

00:31:53.159 --> 00:32:00.194
but then the output is going to be, we have a 28 by 28 feature
map for each of the 128 filters that we have in this conv layer.

00:32:00.194 --> 00:32:02.361
So we get 28 by 28 by 128.

00:32:05.469 --> 00:32:14.939
OK and then now if we do the same thing and we look at the filter sizes
of the output sizes sorry of all of the different filters here, after the

00:32:14.939 --> 00:32:20.379
three by three conv we're going to have this volume
of 28 by 28 by 192 right after five by five conv

00:32:20.379 --> 00:32:24.559
we have 96 filters here.
So 28 by 28 by 96,

00:32:24.559 --> 00:32:34.712
and then out pooling layer is just going to keep the same spatial
dimension here, so pooling layer will preserve it in depth,

00:32:34.712 --> 00:32:40.192
and here because of our stride, we're also
going to preserve our spatial dimensions.

00:32:41.225 --> 00:32:51.498
And so now if we look at the output size after filter concatenation what we're
going to get is 28 by 28, these are all 28 by 28, and we concatenating depth wise.

00:32:51.498 --> 00:32:59.330
So we get 28 by 28 times all of these added together, and
the total output size is going to be 28 by 28 by 672.

00:33:01.113 --> 00:33:10.208
So the input to our inception module was 28 by 28 by 256,
then the output from this module is 28 by 28 by 672.

00:33:11.466 --> 00:33:17.254
So we kept the same spatial dimensions,
and we blew up the depth.

00:33:17.254 --> 00:33:18.188
Question.

00:33:18.188 --> 00:33:21.905
[student speaks off mic]

00:33:21.905 --> 00:33:25.546
OK So in this case, yeah, the question is,
how are we getting 28 by 28 for everything?

00:33:25.546 --> 00:33:29.307
So here we're doing all the zero padding in
order to maintain the spatial dimensions,

00:33:29.307 --> 00:33:33.403
and that way we can do this filter
concatenation depth-wise.

00:33:34.395 --> 00:33:36.233
Question in the back.

00:33:36.233 --> 00:33:39.650
[student speaks off mic]

00:33:44.824 --> 00:33:47.805
- OK The question is what's
the 256 deep at the input,

00:33:47.805 --> 00:33:53.814
and so this is not the input to the network, this is the
input just to this local module that I'm looking at.

00:33:53.814 --> 00:34:00.506
So in this case 256 is the depth of the previous
inception module that came just before this.

00:34:00.506 --> 00:34:08.438
And so now coming out we have 28 by 28 by 672, and that's
going to be the input to the next inception module.

00:34:08.438 --> 00:34:09.915
Question.

00:34:09.916 --> 00:34:13.333
[student speaks off mic]

00:34:17.039 --> 00:34:23.181
- Okay the question is, how did we get 28 by
28 by 128 for the first one, the first conv,

00:34:23.181 --> 00:34:34.058
and this is basically it's a one by one convolution right, so we're going to take
this one by one convolution slide it across our 28 by 28 by 256 input spatially

00:34:35.485 --> 00:34:41.956
where it's at each location, it's going to multiply, it's going
to do a [mumbles] through the entire 256 depth, and so we do this

00:34:41.956 --> 00:34:46.983
one by one conv slide it over spatially and we
get a feature map out that's 28 by 28 by one.

00:34:46.983 --> 00:34:58.311
There's one number at each spatial location coming out, and each filter produces
one of these 28 by 28 by one maps, and we have here a total 128 filters,

00:35:01.050 --> 00:35:04.800
and that's going to
produce 28 by 28, by 128.

00:35:05.809 --> 00:35:10.403
OK so if you look at the number of operations
that are happening in the convolutional layer,

00:35:10.403 --> 00:35:22.553
let's look at the first one for example this one by one conv as I was just
saying at each each location we're doing a one by one by 256 dot product.

00:35:24.545 --> 00:35:28.358
So there's 256 multiply
operations happening here

00:35:28.358 --> 00:35:37.865
and then for each filter map we have 28 by 28 spatial locations, so
that's the first 28 times 28 first two numbers that are multiplied here.

00:35:37.865 --> 00:35:53.859
These are the spatial locations for each filter map, and so we have to do this to 25 60 multiplication
each one of these then we have 128 total filters at this layer, or we're producing 128 total feature maps.

00:35:53.859 --> 00:36:01.221
And so the total number of these operations here
is going to be 28 times 28 times 128 times 256.

00:36:02.129 --> 00:36:10.349
And so this is going to be the same for, you can think about this for the three
by three conv, and the five by five conv, that's exactly the same principle.

00:36:10.349 --> 00:36:16.690
And in total we're going to get 854 million
operations that are happening here.

00:36:17.968 --> 00:36:21.191
- [Student] And the 128,
192, and 96 are just values

00:36:22.131 --> 00:36:29.044
- Question the 128, 192 and 256 are values that I picked.
Yes, these are not values that I just came up with.

00:36:29.044 --> 00:36:35.594
They are similar to the ones that you will see
in like a particular layer of inception net,

00:36:35.594 --> 00:36:43.103
so in GoogleNet basically, each module has a different set of these
kinds of parameters, and I picked one that was similar to one of these.

00:36:45.089 --> 00:36:49.046
And so this is very expensive computationally
right, these these operations.

00:36:49.046 --> 00:36:55.507
And then the other thing that I also want to note is that the pooling layer
also adds to this problem because it preserves the whole feature depth.

00:36:57.062 --> 00:37:03.519
So at every layer your total depth can only grow
right, you're going to take the full featured depth

00:37:03.519 --> 00:37:10.513
from your pooling layer, as well as all the additional
feature maps from the conv layers and add these up together.

00:37:10.513 --> 00:37:18.960
So here our input was 256 depth and our output is 672 depth
and you're just going to keep increasing this as you go up.

00:37:21.920 --> 00:37:25.441
So how do we deal with this and how
do we keep this more manageable?

00:37:25.441 --> 00:37:36.181
And so one of the key insights that GoogleNet used was that well we can we can
address this by using bottleneck layers and try and project these feature maps

00:37:36.181 --> 00:37:43.174
to lower dimension before our our convolutional
operations, so before our expensive layers.

00:37:45.007 --> 00:37:46.642
And so what exactly does that mean?

00:37:46.642 --> 00:37:58.080
So reminder one by one convolution, I guess we were just going through this but it's taking your input volume,
it's performing a dot product at each spatial location and what it does is it preserves spatial dimension

00:38:00.141 --> 00:38:06.139
but it reduces the depth and it reduces that by
projecting your input depth to a lower dimension.

00:38:06.139 --> 00:38:10.515
It just takes it's basically like a linear
combination of your input feature maps.

00:38:12.880 --> 00:38:18.199
And so this main idea is that it's projecting
your depth down and so the inception module

00:38:18.199 --> 00:38:29.085
takes these one by one convs and adds these at a bunch of places in these modules
where there's going to be, in order to alleviate this expensive compute.

00:38:29.085 --> 00:38:36.162
So before the three by three and five by five conv
layers, it puts in one of these one by one convolutions.

00:38:36.162 --> 00:38:42.315
And then after the pooling layer it also
puts an additional one by one convolution.

00:38:43.284 --> 00:38:47.609
Right so these are the one by one
bottleneck layers that are added in.

00:38:48.562 --> 00:38:52.736
And so how does this change the math
that we were looking at earlier?

00:38:52.736 --> 00:38:58.589
So now basically what's happening is that we
still have the same input here 28 by 28 by 256,

00:38:58.589 --> 00:39:12.856
but these one by one convs are going to reduce the depth dimension and so you can see before the three by
three convs, if I put a one by one conv with 64 filters, my output from that is going to be, 28 by 28 by 64.

00:39:14.184 --> 00:39:25.154
So instead of now going into the three by three convs afterwards instead of
28 by 28 by 256 coming in, we only have a 28 by 28, by 64 block coming in.

00:39:25.154 --> 00:39:31.454
And so this is now reducing the smaller input
going into these conv layers, the same thing for

00:39:31.454 --> 00:39:40.499
the five by five conv, and then for the pooling layer, after the
pooling comes out, we're going to reduce the depth after this.

00:39:41.562 --> 00:39:51.214
And so, if you work out the math the same way for all of the convolutional ops here,
adding in now all these one by one convs on top of the three by threes and five by fives,

00:39:51.214 --> 00:40:02.499
the total number of operations is 358 million operations, so it's much less than
the 854 million that we had in the naive version, and so you can see how you

00:40:02.499 --> 00:40:10.438
can use this one by one conv, and the filter
size for that to control your computation.

00:40:10.438 --> 00:40:12.118
Yes, question in the back.

00:40:12.118 --> 00:40:15.535
[student speaks off mic]

00:40:23.525 --> 00:40:30.979
- Yes, so the question is, have you looked into what information
might be lost by doing this one by one conv at the beginning.

00:40:30.979 --> 00:40:35.112
And so there might be
some information loss,

00:40:35.112 --> 00:40:46.013
but at the same time if you're doing these projections you're taking a linear combination of
these input feature maps which has redundancy in them, you're taking combinations of them,

00:40:47.623 --> 00:40:59.422
and you're also introducing an additional non-linearity after the one by one conv, so it also actually
helps in that way with adding a little bit more depth and so, I don't think there's a rigorous analysis

00:40:59.422 --> 00:41:07.314
of this, but basically in general this works
better and there's reasons why it helps as well.

00:41:07.314 --> 00:41:15.627
OK so here we have, we're basically using these one by
one convs to help manage our computational complexity,

00:41:15.627 --> 00:41:20.450
and then what GooleNet does is it takes these inception
modules and it's going to stack all these together.

00:41:20.450 --> 00:41:22.827
So this is a full inception architecture.

00:41:22.827 --> 00:41:32.773
And if we look at this a little bit more detail, so here I've flipped it,
because it's so big, it's not going to fit vertically any more on the slide.

00:41:32.773 --> 00:41:41.867
So what we start with is we first have this stem network, so this is more the kind
of vanilla plain conv net that we've seen earlier [mumbles] six sequence of layers.

00:41:43.256 --> 00:41:48.570
So conv pool a couple of convs in another
pool just to get started and then after that

00:41:48.570 --> 00:41:54.911
we have all of our different our multiple inception
modules all stacked on top of each other,

00:41:54.911 --> 00:41:58.433
and then on top we have
our classifier output.

00:41:58.433 --> 00:42:08.982
And notice here that they've really removed the expensive fully connected layers it turns
out that the model works great without them, even and you reduce a lot of parameters.

00:42:08.982 --> 00:42:17.098
And then what they also have here is, you can see these couple of
extra stems coming out and these are auxiliary classification outputs

00:42:18.866 --> 00:42:23.273
and so these are also you know
just a little mini networks

00:42:23.273 --> 00:42:29.217
with an average pooling, a one by one conv, a
couple of fully connected layers here going to

00:42:29.217 --> 00:42:35.702
the soft Max and also a 1000 way SoftMax
with the ImageNet classes.

00:42:35.702 --> 00:42:41.350
And so you're actually using your ImageNet training
classification loss in three separate places here.

00:42:41.350 --> 00:42:51.752
The standard end of the network, as well as in these two places earlier on
in the network, and the reason they do that is just this is a deep network

00:42:51.752 --> 00:43:02.140
and they found that having these additional auxiliary classification
outputs, you get more gradient training injected at the earlier layers,

00:43:02.140 --> 00:43:13.484
and so more just helpful signal flowing in because these intermediate layers should also
be helpful. You should be able to do classification based off some of these as well.

00:43:13.484 --> 00:43:20.711
And so this is the full architecture,
there's 22 total layers with weights and so

00:43:20.711 --> 00:43:29.474
within each of these modules each of those one by one, three by three, five
by five is a weight layer, just including all of these parallel layers,

00:43:29.474 --> 00:43:44.128
and in general it's a relatively more carefully designed architecture and part of
this is based on some of these intuitions that we're talking about and part of them

00:43:44.128 --> 00:43:55.511
also is just you know Google the authors they had huge clusters and they're cross
validating across all kinds of design choices and this is what ended up working well.

00:43:55.511 --> 00:43:57.105
Question?

00:43:57.105 --> 00:44:00.522
[student speaks off mic]

00:44:24.442 --> 00:44:32.457
- Yeah so the question is, are the auxiliary outputs actually
useful for the final classification, to use these as well?

00:44:32.457 --> 00:44:39.164
I think when they're training them they do average all
these for the losses coming out. I think they are helpful.

00:44:39.164 --> 00:44:49.272
I can't remember if in the final architecture, whether they average all of these or just take
one, it seems very possible that they would use all of them, but you'll need to check on that.

00:44:49.272 --> 00:44:52.689
[student speaks off mic]

00:44:58.352 --> 00:45:10.219
- So the question is for the bottleneck layers, is it possible to use some other types
of dimensionality reduction and yes you can use other kinds of dimensionality reduction.

00:45:10.219 --> 00:45:17.138
The benefits here of this one by one conv is, you're getting this
effect, but it's all, you know it's a conv layer just like any other.

00:45:17.138 --> 00:45:26.180
You have the soul network of these, you just train it this full network back [mumbles]
through everything, and it's learning how to combine the previous feature maps.

00:45:28.601 --> 00:45:30.730
Okay yeah, question in the back.

00:45:30.730 --> 00:45:34.147
[student speaks off mic]

00:45:35.807 --> 00:45:42.549
- Yes so, question is are any weights
shared or all they all separate and yeah,

00:45:42.549 --> 00:45:45.542
all of these layers have separate weights.

00:45:45.542 --> 00:45:46.690
Question.

00:45:46.690 --> 00:45:50.107
[student speaks off mic]

00:45:56.784 --> 00:46:00.143
- Yes so the question is why do we have
to inject gradients at earlier layers?

00:46:00.143 --> 00:46:07.785
So our classification output at the very end, where we get a gradient
on this, it's passed all the way back through the chain roll

00:46:09.599 --> 00:46:21.178
but the problem is when you have very deep networks and you're going all the way back through these, some
of this gradient signal can become minimized and lost closer to the beginning, and so that's why having

00:46:21.178 --> 00:46:28.377
these additional ones in earlier parts
can help provide some additional signal.

00:46:28.377 --> 00:46:32.667
[student mumbles off mic]

00:46:32.667 --> 00:46:35.853
- So the question is are you doing back
prop all the times for each output.

00:46:35.853 --> 00:46:41.446
No it's just one back prop all the way
through, and you can think of these three,

00:46:41.446 --> 00:46:48.075
you can think of there being kind of like an addition at the end of these
if you were to draw up your computational graph, and so you get your

00:46:48.075 --> 00:46:54.004
final signal and you can just take all of these
gradients and just back plot them all the way through.

00:46:54.004 --> 00:46:58.970
So it's as if they were added together
at the end in a computational graph.

00:46:58.970 --> 00:47:05.423
OK so in the interest of time because we still have a
lot to get through, can take other questions offline.

00:47:07.353 --> 00:47:10.520
Okay so GoogleNet basically 22 layers.

00:47:11.441 --> 00:47:15.983
It has an efficient inception module,
there's no fully connected layers.

00:47:15.983 --> 00:47:22.026
12 times fewer parameters than AlexNet, and
it's the ILSVRC 2014 classification winner.

00:47:25.228 --> 00:47:30.869
And so now let's look at the 2015 winner,
which is the ResNet network and so here

00:47:30.869 --> 00:47:38.339
this idea is really, this revolution of depth net right.
We were starting to increase depth in 2014, and here we've

00:47:38.339 --> 00:47:45.616
just had this hugely deeper model at 152
layers was the ResNet architecture.

00:47:45.616 --> 00:47:48.846
And so now let's look at that
in a little bit more detail.

00:47:48.846 --> 00:47:54.286
So the ResNet architecture, is getting extremely
deep networks, much deeper than any other networks

00:47:54.286 --> 00:48:00.479
before and it's doing this using this idea of
residual connections which we'll talk about.

00:48:00.479 --> 00:48:04.158
And so, they had 152
layer model for ImageNet.

00:48:04.158 --> 00:48:07.969
They were able to get 3.5
of 7% top 5 error with this

00:48:07.969 --> 00:48:18.114
and the really special thing is that they swept all classification and detection
contests in the ImageNet mart benchmark and this other benchmark called COCO.

00:48:18.114 --> 00:48:23.546
It just basically won everything. So it was
just clearly better than everything else.

00:48:25.055 --> 00:48:32.538
And so now let's go into a little bit of the motivation
behind ResNet and residual connections that we'll talk about.

00:48:32.538 --> 00:48:41.939
And the question that they started off by trying to answer is what happens when we
try and stack deeper and deeper layers on a plain convolutional neural network?

00:48:41.939 --> 00:48:53.874
So if we take something like VGG or some normal network that's just stacks of conv and pool layers
on top of each other can we just continuously extend these, get deeper layers and just do better?

00:48:55.601 --> 00:48:58.421
And and the answer is no.

00:48:58.421 --> 00:49:06.599
So if you so if you look at what happens when you get deeper, so here I'm
comparing a 20 layer network and a 56 layer network and so this is just a plain

00:49:09.498 --> 00:49:16.817
kind of network you'll see that in the test error here on the right
the 56 layer network is doing worse than the 28 layer network.

00:49:16.817 --> 00:49:19.771
So the deeper network was
not able to do better.

00:49:19.771 --> 00:49:29.680
But then the really weird thing is now if you look at the training error
right we here have again the 20 layer network and a 56 layer network.

00:49:29.680 --> 00:49:40.271
The 56 layer network, one of the obvious problems you think, I have a really deep network,
I have tons of parameters maybe it's probably starting to over fit at some point.

00:49:41.294 --> 00:49:48.985
But what actually happens is that when you're over fitting you would expect
to have very good, very low training error rate, and just bad test error,

00:49:48.985 --> 00:49:55.511
but what's happening here is that in the training error the 56
layer network is also doing worse than the 20 layer network.

00:49:56.833 --> 00:50:01.545
And so even though the deeper model performs
worse, this is not caused by over-fitting.

00:50:03.462 --> 00:50:10.253
And so the hypothesis of the ResNet creators is that
the problem is actually an optimization problem.

00:50:10.253 --> 00:50:15.611
Deeper models are just harder to optimize,
than more shallow networks.

00:50:16.835 --> 00:50:23.263
And the reasoning was that well, a deeper model should be
able to perform at least as well as a shallower model.

00:50:23.263 --> 00:50:32.330
You can have actually a solution by construction where you just take the learned layers
from your shallower model, you just copy these over and then for the remaining additional

00:50:32.330 --> 00:50:35.192
deeper layers you just
add identity mappings.

00:50:35.192 --> 00:50:39.533
So by construction this should be working
just as well as the shallower layer.

00:50:39.533 --> 00:50:46.295
And your model that weren't able to learn properly,
it should be able to learn at least this.

00:50:46.295 --> 00:51:00.594
And so motivated by this their solution was well how can we make it easier for our
architecture, our model to learn these kinds of solutions, or at least something like this?

00:51:00.594 --> 00:51:11.794
And so their idea is well instead of just stacking all these layers on top
of each other and having every layer try and learn some underlying mapping

00:51:11.794 --> 00:51:21.708
of a desired function, lets instead have these blocks, where we
try and fit a residual mapping, instead of a direct mapping.

00:51:21.708 --> 00:51:28.220
And so what this looks like is here on this right where
the input to these block is just the input coming in

00:51:29.818 --> 00:51:48.499
and here we are going to use our, here on the side, we're going to use our layers to try and fit
some residual of our desire to H of X, minus X instead of the desired function H of X directly.

00:51:49.450 --> 00:51:55.827
And so basically at the end of this block we take
the step connection on this right here, this loop,

00:51:55.827 --> 00:52:07.241
where we just take our input, we just use pass it through as an identity, and so if we had no weight layers
in between it was just going to be the identity it would be the same thing as the output, but now we use

00:52:07.241 --> 00:52:12.562
our additional weight layers to learn
some delta, for some residual from our X.

00:52:14.067 --> 00:52:24.502
And so now the output of this is going to be just our original R X plus some
residual that we're going to call it. It's basically a delta and so the idea is that

00:52:24.502 --> 00:52:31.428
now the output it should be easy for example,
in the case where identity is ideal,

00:52:32.510 --> 00:52:39.249
to just squash all of these weights of F of X
from our weight layers just set it to all zero

00:52:39.249 --> 00:52:48.578
for example, then we're just going to get identity as the output, and we can get
something, for example, close to this solution by construction that we had earlier.

00:52:48.578 --> 00:53:00.962
Right, so this is just a network architecture that says okay, let's try and fit this, learn how our
weight layers residual, and be something close, that way it'll more likely be something close to X,

00:53:00.962 --> 00:53:05.388
it's just modifying X, than to learn exactly
this full mapping of what it should be.

00:53:05.388 --> 00:53:08.249
Okay, any questions about this?

00:53:08.249 --> 00:53:09.189
[student speaks off mic]

00:53:09.189 --> 00:53:12.689
- Question is is there the same dimension?

00:53:13.770 --> 00:53:17.603
So yes these two paths
are the same dimension.

00:53:18.752 --> 00:53:32.288
In general either it's the same dimension, or what they actually do is they have these projections and shortcuts
and they have different ways of padding to make things work out to be the same dimension. Depth wise.

00:53:32.288 --> 00:53:33.395
Yes

00:53:33.395 --> 00:53:39.120
- [Student] When you use the word residual
you were talking about [mumbles off mic]

00:53:45.857 --> 00:53:53.638
- So the question is what exactly do we mean by residual
this output of this transformation is a residual?

00:53:53.638 --> 00:54:01.899
So we can think of our output here right as this F of X
plus X, where F of X is the output of our transformation

00:54:01.899 --> 00:54:06.650
and then X is our input, just
passed through by the identity.

00:54:06.650 --> 00:54:17.198
So we'd like to using a plain layer, what we're trying to do is learn something
like H of X, but what we saw earlier is that it's hard to learn H of X.

00:54:17.198 --> 00:54:20.671
It's a good H of X as we
get very deep networks.

00:54:20.671 --> 00:54:29.438
And so here the idea is let's try and break it down instead of as H
of X is equal to F of X plus, and let's just try and learn F of X.

00:54:29.438 --> 00:54:39.741
And so instead of learning directly this H of X we just want to learn what is it
that we need to add or subtract to our input as we move on to the next layer.

00:54:39.741 --> 00:54:45.889
So you can think of it as kind of modifying
this input, in place in a sense. We have--

00:54:45.889 --> 00:54:49.121
[interrupted by student mumbling off mic]

00:54:49.121 --> 00:54:58.129
- The question is, when we're saying the word residual are we talking about F of X?
Yeah. So F of X is what we're calling the residual. And it just has that meaning.

00:55:01.477 --> 00:55:03.941
Yes another question.

00:55:03.941 --> 00:55:07.441
[student mumbles off mic]

00:55:11.319 --> 00:55:20.145
- So the question is in practice do we just sum F of X and X together,
or do we learn some weighted combination and you just do a direct sum.

00:55:20.145 --> 00:55:28.809
Because when you do a direct sum, this is the idea of let
me just learn what is it I have to add or subtract onto X.

00:55:30.652 --> 00:55:34.463
Is this clear to everybody,
the main intuition?

00:55:34.463 --> 00:55:35.361
Question.

00:55:35.361 --> 00:55:38.778
[student speaks off mic]

00:55:40.721 --> 00:55:47.099
- Yeah, so the question is not clear why is it that learning the
residual should be easier than learning the direct mapping?

00:55:47.099 --> 00:55:58.747
And so this is just their hypotheses, and a hypotheses is that if we're
learning the residual you just have to learn what's the delta to X right?

00:55:58.747 --> 00:56:16.101
And if our hypotheses is that generally even something like our solution by construction, where we had some number of these
shallow layers that were learned and we had all these identity mappings at the top this was a solution that should have been

00:56:16.101 --> 00:56:23.985
good, and so that implies that maybe a lot of these layers,
actually something just close to identity, would be a good layer

00:56:23.985 --> 00:56:30.954
And so because of that, now we formulate this as being
able to learn the identity plus just a little delta.

00:56:30.954 --> 00:56:34.315
And if really the identity
is best we just make

00:56:34.315 --> 00:56:40.363
F of X squashes transformation to just be zero, which is
something that's relatively, might seem easier to learn,

00:56:40.363 --> 00:56:44.784
also we're able to get things that
are close to identity mappings.

00:56:44.784 --> 00:56:50.966
And so again this is not something that's necessarily
proven or anything it's just the intuition and hypothesis,

00:56:50.966 --> 00:56:58.708
and then we'll also see later some works where people are actually trying to challenge
this and say oh maybe it's not actually the residuals that are so necessary,

00:56:58.708 --> 00:57:07.507
but at least this is the hypothesis for this paper, and in
practice using this model, it was able to do very well.

00:57:07.507 --> 00:57:08.810
Question.

00:57:08.810 --> 00:57:12.227
[student speaks off mic]

00:57:41.813 --> 00:57:49.128
- Yes so the question is have people tried other ways
of combining the inputs from previous layers and yes

00:57:49.128 --> 00:57:56.747
so this is basically a very active area of research on and how we formulate all
these connections, and what's connected to what in all of these structures.

00:57:56.747 --> 00:58:04.695
So we'll see a few more examples of different network architectures
briefly later but this is an active area of research.

00:58:05.658 --> 00:58:12.093
OK so we basically have all of these residual
blocks that are stacked on top of each other.

00:58:12.093 --> 00:58:14.788
We can see the full resident architecture.

00:58:14.788 --> 00:58:27.299
Each of these residual blocks has two three by three conv layers as part of this block and
there's also been work just saying that this happens to be a good configuration that works well.

00:58:27.299 --> 00:58:29.828
We stack all these blocks
together very deeply.

00:58:29.828 --> 00:58:40.851
Another thing like with this very deep architecture it's basically also
enabling up to 150 layers deep of this, and then what we do is we stack

00:58:46.582 --> 00:58:53.982
all these and periodically we also double the number of filters
and down sample spatially using stride two when we do that.

00:58:55.856 --> 00:59:03.867
And then we have this additional [mumbles] at the very beginning of our
network and at the end we also hear, don't have any fully connected layers

00:59:03.867 --> 00:59:08.641
and we just have a global average pooling layer
that's going to average over everything spatially,

00:59:08.641 --> 00:59:12.808
and then be input into the
last 1000 way classification.

00:59:14.694 --> 00:59:16.991
So this is the full ResNet architecture

00:59:16.991 --> 00:59:21.935
and it's very simple and elegant just stacking up
all of these ResNet blocks on top of each other,

00:59:21.935 --> 00:59:29.389
and they have total depths of up to 34, 50,
100, and they tried up to 152 for ImageNet.

00:59:34.230 --> 00:59:43.964
OK so one additional thing just to know is that for a very deep network, so
the ones that are more than 50 layers deep, they also use bottleneck layers

00:59:43.964 --> 00:59:46.663
similar to what GoogleNet did
in order to improve efficiency

00:59:46.663 --> 00:59:57.195
and so within each block now you're going to, what they did is, have this
one by one conv filter, that first projects it down to a smaller depth.

00:59:57.195 --> 01:00:07.949
So again if we are looking at let's say 28 by 28 by 256 implant, we do this one
by one conv, it's taking it's projecting the depth down. We get 28 by 28 by 64.

01:00:09.107 --> 01:00:18.486
Now your convolution your three by three conv, in here they only have one,
is operating over this reduced step so it's going to be less expensive,

01:00:18.486 --> 01:00:29.870
and then afterwards they have another one by one conv that projects the depth back
up to 256, and so, this is the actual block that you'll see in deeper networks.

01:00:33.021 --> 01:00:41.282
So in practice the ResNet also uses batch normalization
after every conv layer, they use Xavier initialization

01:00:41.282 --> 01:00:50.578
with an extra scaling factor that they helped introduce to
improve the initialization trained with SGD + momentum.

01:00:51.604 --> 01:00:59.470
Their learning rate they use a similar learning rate type of schedule
where you decay your learning rate when your validation error plateaus.

01:01:01.751 --> 01:01:05.874
Mini batch size 256, a little bit
of weight decay and no drop out.

01:01:07.645 --> 01:01:13.581
And so experimentally they were able to show that they were
able to train these very deep networks, without degrading.

01:01:13.581 --> 01:01:19.060
They were able to have basically good gradient flow
coming all the way back down through the network.

01:01:19.060 --> 01:01:22.625
They tried up to 152 layers on ImageNet,

01:01:22.625 --> 01:01:26.632
1200 on Cifar, which is a,
you have played with it,

01:01:26.632 --> 01:01:35.024
but a smaller data set and they also saw that now you're deeper
networks are able to achieve lower training errors as expected.

01:01:36.303 --> 01:01:44.543
So you don't have the same strange plots that we saw
earlier where the behavior was in the wrong direction.

01:01:44.543 --> 01:01:54.843
And so from here they were able to sweep first place at all of the ILSVRC
competitions, and all of the COCO competitions in 2015 by a significant margins.

01:01:56.152 --> 01:02:06.649
Their total top five error was 3.6 % for a classification and this
is actually better than human performance in the ImageNet paper.

01:02:08.902 --> 01:02:22.551
There was also a human metric that came from actually [mumbles] our lab Andre Kapathy
spent like a week training himself and then basically did all of, did this task himself

01:02:24.730 --> 01:02:34.191
and was I think somewhere around 5-ish %, and so I was
basically able to do better than the then that human at least.

01:02:36.175 --> 01:02:42.069
Okay, so these are kind of the main
networks that have been used recently.

01:02:42.069 --> 01:02:48.004
We had AlexNet starting off with first,
VGG and GoogleNet are still very popular,

01:02:48.004 --> 01:02:58.218
but ResNet is the most recent best performing model that if you're looking for
something training a new network ResNet is available, you should try working with it.

01:03:00.154 --> 01:03:06.403
So just quickly looking at some of this getting
a better sense of the complexity involved.

01:03:06.403 --> 01:03:14.120
So here we have some plots that are sorted by performance
so this is top one accuracy here, and higher is better.

01:03:15.275 --> 01:03:21.540
And so you'll see a lot of these models that we talked about, as well
as some different versions of them so, this GoogleNet inception thing,

01:03:21.540 --> 01:03:31.389
I think there's like V2, V3 and the best one here is V4, which is
actually a ResNet plus inception combination, so these are just kind of

01:03:31.389 --> 01:03:39.159
more incremental, smaller changes that they've built on top
of them, and so that's the best performing model here.

01:03:39.159 --> 01:03:45.446
And if we look on the right, these plots of their
computational complexity here it's sorted.

01:03:47.686 --> 01:03:52.313
The Y axis is your top one accuracy
so higher is better.

01:03:52.313 --> 01:04:03.074
The X axis is your operations and so the more to the right, the more ops you're doing, the more
computationally expensive and then the bigger the circle, your circle is your memory usage,

01:04:03.074 --> 01:04:07.251
so the gray circles are referenced here, but
the bigger the circle the more memory usage

01:04:07.251 --> 01:04:16.206
and so here we can see that VGG these green ones are kind of the
least efficient. They have the biggest memory, the most operations,

01:04:16.206 --> 01:04:18.623
but they they do pretty well.

01:04:19.838 --> 01:04:29.275
GoogleNet is the most efficient here. It's way down on the
operation side, as well as a small little circle for memory usage.

01:04:29.275 --> 01:04:39.411
AlexNet, our earlier model, has lowest accuracy. It's relatively smaller compute,
because it's a smaller network, but it's also not particularly memory efficient.

01:04:41.309 --> 01:04:46.216
And then ResNet here, we
have moderate efficiency.

01:04:46.216 --> 01:04:52.500
It's kind of in the middle, both in terms of memory
and operations, and it has the highest accuracy.

01:04:56.029 --> 01:04:58.028
And so here also are
some additional plots.

01:04:58.028 --> 01:05:14.868
You can look at these more on your own time, but this plot on the left is showing the forward pass time and so this is in milliseconds and you
can up at the top VGG forward passes about 200 milliseconds you can get about five frames per second with this, and this is sorted in order.

01:05:14.868 --> 01:05:25.883
There's also this plot on the right looking at power consumption and if you look more at
this paper here, there's further analysis of these kinds of computational comparisons.

01:05:30.604 --> 01:05:38.750
So these were the main architectures that you should really know
in-depth and be familiar with, and be thinking about actively using.

01:05:38.750 --> 01:05:48.263
But now I'm going just to go briefly through some other architectures that are
just good to know either historical inspirations or more recent areas of research.

01:05:50.716 --> 01:05:56.342
So the first one Network in Network, this
is from 2014, and the idea behind this

01:06:00.529 --> 01:06:16.118
is that we have these vanilla convolutional layers but we also have these, this introduces the idea of MLP conv
layers they call it, which are micro networks or basically network within networth, the name of the paper.

01:06:16.118 --> 01:06:23.152
Where within each conv layer trying to stack an MLP
with a couple of fully connected layers on top of

01:06:23.152 --> 01:06:29.167
just the standard conv and be able to compute more
abstract features for these local patches right.

01:06:29.167 --> 01:06:41.975
So instead of sliding just a conv filter around, it's sliding a slightly more complex
hierarchical set of filters around and using that to get the activation maps.

01:06:41.975 --> 01:06:47.941
And so, it uses these fully connected, or
basically one by one conv kind of layers.

01:06:47.941 --> 01:06:57.196
It's going to stack them all up like the bottom diagram here where we
just have these networks within networks stacked in each of the layers.

01:06:57.196 --> 01:07:10.102
And the main reason to know this is just it was kind of a precursor to GoogleNet and ResNet
in 2014 with this idea of bottleneck layers that you saw used very heavily in there.

01:07:10.102 --> 01:07:22.070
And it also had a little bit of philosophical inspiration for GoogleNet for this idea of a local
network typology network in network that they also used, with a different kind of structure.

01:07:24.238 --> 01:07:36.759
Now I'm going to talk about a series of works, on, or works since ResNet that are mostly
geared towards improving resNet and so this is more recent research has been done since then.

01:07:36.759 --> 01:07:39.911
I'm going to go over these pretty fast,
and so just at a very high level.

01:07:39.911 --> 01:07:44.754
If you're interested in any of these you should
look at the papers, to have more details.

01:07:45.755 --> 01:07:55.719
So the authors of ResNet a little bit later on in 2016 also
had this paper where they improved the ResNet block design.

01:07:56.742 --> 01:08:03.015
And so they basically adjusted what were the
layers that were in the ResNet block path,

01:08:03.015 --> 01:08:18.861
and showed this new structure was able to have a more direct path in order for propagating information throughout the
network, and you want to have a good path to propagate information all the way up, and then back up all the way down again.

01:08:18.861 --> 01:08:25.319
And so they showed that this new block was better
for that and was able to give better performance.

01:08:25.319 --> 01:08:28.959
There's also a Wide
networks which this paper

01:08:28.959 --> 01:08:40.228
argued that while ResNets made networks much deeper as well as added these residual
connections and their argument was that residuals are really the important factor.

01:08:40.228 --> 01:08:45.290
Having this residual construction, and not
necessarily having extremely deep networks.

01:08:45.290 --> 01:08:52.794
And so what they did was they used wider residual blocks, and
so what this means is just more filters in every conv layer.

01:08:52.794 --> 01:09:02.661
So before we might have F filters per layer and they use these factors of
K and said well, every layer it's going to be F times K filters instead.

01:09:02.663 --> 01:09:11.502
And so, using these wider layers they showed that their 50 layer
wide ResNet was able to out-perform the 152 layer original ResNet,

01:09:13.754 --> 01:09:23.035
and it also had the additional advantages of increasing with this, even
with the same amount of parameters, tit's more computationally efficient

01:09:23.035 --> 01:09:26.922
because you can parallelize these
with operations more easily.

01:09:26.923 --> 01:09:39.546
Right just convolutions with more neurons just spread across more kernels as opposed to depth
that's more sequential, so it's more computationally efficient to increase your width.

01:09:39.546 --> 01:09:49.817
- So here you can see this work is starting to trying to understand the contributions of width and depth and residual connections,
- and making some arguments for one way versus the other.

01:09:49.817 --> 01:09:58.125
And this other paper around the same time, I
think maybe a little bit later, is ResNeXt,

01:09:58.125 --> 01:10:04.383
and so this is again, the creators of ResNet
continuing to work on pushing the architecture.

01:10:04.383 --> 01:10:18.576
And here they also had this idea of okay, let's indeed tackle this width thing more but instead
of just increasing the width of this residual block through more filters they have structure.

01:10:18.576 --> 01:10:26.415
And so within each residual block, multiple parallel pathways and they're
going to call the total number of these pathways the cardinality.

01:10:26.415 --> 01:10:36.317
And so it's basically taking the one ResNet block with the bottlenecks and
having it be relatively thinner, but having multiple of these done in parallel.

01:10:38.395 --> 01:10:44.452
And so here you can also see that this both have
some relation to this idea of wide networks,

01:10:44.452 --> 01:10:54.023
as well as to has some connection to the inception module as well right
where we have these parallel, these layers operating in parallel.

01:10:54.023 --> 01:10:58.190
And so now this ResNeXt has
some flavor of that as well.

01:11:00.838 --> 01:11:13.878
So another approach towards improving ResNets was this idea called Stochastic Depth
and in this work the motivation is well let's look more at this depth problem.

01:11:13.878 --> 01:11:21.537
Once you get deeper and deeper the typical problems
that you're going to have vanishing gradients right.

01:11:21.537 --> 01:11:32.071
You're not able to, your gradients will get smaller and eventually vanish as you're
trying to back propagate them over very long layers, or a large number of layers.

01:11:32.071 --> 01:11:43.045
And so what their motivation is well let's try to have short networks during training
and they use this idea of dropping out a subset of the layers during training.

01:11:43.045 --> 01:11:48.436
And so for a subset of the layers they just drop out the
weights and they just set it to identity connection,

01:11:48.436 --> 01:11:56.126
and now what you get is you have these shorter networks
during training, you can pass back your gradients better.

01:11:56.126 --> 01:12:04.074
It's also a little more efficient, and then it's kind of like the
drop out right. It has this sort of flavor that you've seen before.

01:12:04.074 --> 01:12:08.108
And then at test time you want to use the
full deep network that you've trained.

01:12:10.446 --> 01:12:19.038
So these are some of the works that looking at the resident architecture, trying
to understand different aspects of it and trying to improve ResNet training.

01:12:19.038 --> 01:12:32.253
And so there's also some works now that are going beyond ResNet that are saying well what are some
non ResNet architectures that maybe can also work better, or comparable or better to ResNets.

01:12:32.253 --> 01:12:45.273
And so one idea is FractalNet, which came out pretty recently, and the argument in FractalNet is that while
residual representations maybe are not actually necessary, so this goes back to what we were talking about earlier.

01:12:45.273 --> 01:12:52.645
What's the motivation of residual networks and it seems to make sense and
there's, you know, good reasons for why this should help but in this paper

01:12:52.645 --> 01:12:58.407
they're saying that well here is a different architecture
that we're introducing, there's no residual representations.

01:12:58.407 --> 01:13:03.898
We think that the key is more about transitioning
effectively from shallow to deep networks,

01:13:03.898 --> 01:13:13.258
and so they have this fractal architecture which has if you look on the
right here, these layers where they compose it in this fractal fashion.

01:13:14.769 --> 01:13:18.639
And so there's both shallow and
deep pathways to your output.

01:13:20.045 --> 01:13:29.568
And so they have these different length pathways, they train them with
dropping out sub paths, and so again it has this dropout kind of flavor,

01:13:29.568 --> 01:13:37.203
and then at test time they'll use the entire fractal network
and they show that this was able to get very good performance.

01:13:39.047 --> 01:13:44.886
There's another idea called Densely Connected
convolutional Networks, DenseNet, and this idea

01:13:44.886 --> 01:13:48.567
is now we have these blocks
that are called dense blocks.

01:13:48.567 --> 01:13:55.940
And within each block each layer is going to be connected to
every other layer after it, in this feed forward fashion.

01:13:55.940 --> 01:14:00.362
So within this block, your input to the block
is also the input to every other conv layer,

01:14:00.362 --> 01:14:08.779
and as you compute each conv output, those outputs are now connected
to every layer after and then, these are all concatenated

01:14:08.779 --> 01:14:18.643
as input to the conv layer, and they do some they have some other
processes for reducing the dimensions and keeping efficient.

01:14:18.643 --> 01:14:30.863
And so their main takeaway from this, is that they argue that this is alleviating a
vanishing gradient problem because you have all of these very dense connections.

01:14:30.863 --> 01:14:37.324
It strengthens feature propagation and then also encourages
future use right because there are so many of these

01:14:37.324 --> 01:14:45.487
connections each feature map that you're learning is input
in multiple later layers and being used multiple times.

01:14:47.906 --> 01:15:03.006
So these are just a couple of ideas that are you know alternatives or what can we do that's not ResNets and yet is
still performing either comparably or better to ResNets and so this is another very active area of current research.

01:15:03.006 --> 01:15:11.830
You can see that a lot of this is looking at the way how different layers
are connected to each other and how depth is managed in these networks.

01:15:13.528 --> 01:15:17.991
And so one last thing that I wanted to
mention quickly, is just efficient networks.

01:15:17.991 --> 01:15:33.994
So this idea of efficiency and you saw that GoogleNet was a work that was looking into this direction of how can we have efficient
networks which are important for you know a lot of practical usage both training as well as especially deployment and so this is

01:15:33.994 --> 01:15:37.927
another recent network
that's called SqueezeNet

01:15:37.927 --> 01:15:41.618
which is looking at very efficient networks.
They have these things called fire modules,

01:15:41.618 --> 01:15:49.645
which consists of a squeeze layer with a lot of one by one filters and then this
feeds then into an expand layer with one by one and three by three filters,

01:15:49.645 --> 01:15:59.220
and they're showing that with this kind of architecture they're able to get
AlexNet level accuracy on ImageNet, but with 50 times fewer parameters,

01:15:59.220 --> 01:16:06.093
and then you can further do network compression on
this to get up to 500 times smaller than AlexNet

01:16:06.093 --> 01:16:10.095
and just have the whole
network just be 0.5 megs.

01:16:10.095 --> 01:16:20.062
And so this is a direction of how do we have efficient networks model compression
that we'll cover more in a lecture later, but just giving you a hint of that.

01:16:21.856 --> 01:16:26.809
OK so today in summary we've talked about
different kinds of CNN Architectures.

01:16:26.809 --> 01:16:31.555
We looked in-depth at four of the main
architectures that you'll see in wide usage.

01:16:31.555 --> 01:16:35.553
AlexNet, one of the early,
very popular networks.

01:16:35.553 --> 01:16:38.832
VGG and GoogleNet which
are still widely used.

01:16:38.832 --> 01:16:45.906
But ResNet is kind of taking over as the thing
that you should be looking most when you can.

01:16:45.906 --> 01:16:50.587
We also looked at these other networks in a
little bit more depth at a brief level overview.

01:16:51.921 --> 01:16:58.228
And so the takeaway that these models that are available they're
in a lot of [mumbles] so you can use them when you need them.

01:16:58.228 --> 01:17:06.827
There's a trend toward extremely deep networks, but there's also
significant research now around the design of how do we connect layers,

01:17:06.827 --> 01:17:15.419
skip connections, what is connected to what, and also using
these to design your architecture to improve gradient flow.

01:17:15.419 --> 01:17:22.748
There's an even more recent trend towards examining what's
the necessity of depth versus width, residual connections.

01:17:22.748 --> 01:17:31.380
Trade offs, what's actually helping matters, and so there's a lot of these recent works in
this direction that you can look into some of the ones I pointed out if you are interested.

01:17:31.380 --> 01:17:33.597
And next time we'll talk about
Recurrent neural networks.